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Q Abstract 

New algorithms are devised for finding the maxima of multidimensional point samples, one of the very first 
problems studied in computational geometry. The algorithms are very simple and easily coded and modified 
I for practical needs. The expected complexity of some measures related to the performance of the algorithms is 

J> analyzed. We also compare the efficiency of the algorithms with a few major ones used in practice, and apply 

CN our algorithms to find the maximal layers and the longest common subsequences of multiple sequences. 

o\ 

Key words. Maximal points, computational geometry, Pareto optimality, sieve algorithms, dominance, multi- 
objective optimization, skyline, average-case analysis of algorithms. 

o 

1 Introduction 

. ^ A point p G l d is said to dominate another point q G R d if the coordinatewise difference p — q has only nonnegative 
^ coordinates and p — q is not identically a zero vector, where the dimensionality d > 1. For convenience, we write 
q -< p or p >- q. The non-dominated points in a sample are called the maxima or maximal points of that sample. 
Note that there may be two identical points that are both maxima according to our definition of dominance. Since 
there is no total order for multidimensional points when d > 1, such a dominance relation among points has been 
one of the simplest and widely used partial orders. We can define dually the corresponding minima of the sample 
by reversing the direction of the dominance relation. 



1.1 Maxima in diverse scientific disciplines 

Daily lives are full of tradeoffs or multi-objective decision problems with often conflicting factors; the numerous 
terms appeared in different scientific fields reveal the importance and popularity of maxima in theory, algorithms, 
applications and practice: maxima (or vector maxima) are sometimes referred to as nondominance, records, outer 
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layers, efficiency, or noninferiority but are more frequently known as Pareto optimality or Pareto efficiency (with 
the natural derivative Pareto front) in econometrics, engineering, multi-objective optimization, decision making, 
etc. Other terms used with essentially the same denotation include admissibility in statistics, Pareto front (and the 
corresponding notion of elitism) in evolutionary algorithms, and skyline in database language; see [2, 16, 22, 23] and 
the references therein and the books [19, 20, 26] for more information. They also proved useful in many computer 
algorithms and are closely related to several well-known problems, including convex hulls, top-A; queries, nearest- 
neighbor search, largest empty rectangles, minimum independent dominating set in permutation graphs, enclosure 
problem for rectilinear <i-gon, polygon decomposition, visibility and illumination, shortest path problem, finding 
empty slimplices, geometric containment problem, data swapping, grid placement problem, and multiple longest 
common subsequence problem to which we will apply our algorithms later; see [16, 46] for more references. 

We describe briefly here the use of maxima in the contexts of database language and multi-objective optimiza- 
tion problems using evolutionary algorithms. 

Skylines in database queries are nothing but minima. A typical situation where the skyline operator arises is 
as follows; see [14] for details. Travelers are searching over the Internet for cheap hotels that are near the beach 
in, say Cote d'Azur. Since the two criteria "lower price" and "shorter distance" are generally conflicting with each 
other and since there are often too many hotels to choose from, one is often interested in those hotels that are 
non-dominated according to the two criteria; here dominance is defined using minima. Much time will be saved if 
the search or sort engine can automatically do this and filter out those that are dominated for database queriers (by, 
say clicking at the skyline operator). On the other hand, frequent spreadsheet users would also appreciate such an 
operator, which can find the maxima, minima or skyline of multidimensional data by simple clicks. 

In view of these and many other natural applications such as e-commerce, multivariate sorting and data visual- 
ization, the skylines have been widely and extensively addressed in recent database literature, notably for low- and 
moderate-dimensional data, following the pioneering paper [14]. In addition to devising efficient skyline-finding 
algorithms, other interesting issues include top-A; representatives, progressiveness, absence of false hits, fairness, 
incorporation of preference, and universality. A large number of skyline-finding algorithms have been proposed for 
various needs; see, for example, [5, 14, 31, 42, 45, 50] and the references therein. 

On the other hand, an area receiving even much more recent attention is the study of multi-objective evolutionary 
algorithms (MOEAs), where the idea of maxima also appeared naturally in the form of non-dominated solutions 
(or elites). MOEAs provide a popular approach for multi-objective optimization, which identify the most feasible 
solutions lying on the Pareto front under various (often conflicting) constraints by repeatedly finding non-dominated 
solutions based on biological evolutionary mechanisms. These algorithms have turned out to be extremely fruitful 
in diverse engineering, industrial and scientific areas, as can be witnessed by the huge number of citations many 
papers on MOEA have received so far. Some popular schemes in this context suggested the maintenance of an 
explicit archive/elite for all non-dominated solutions found so far; see below and [27, 40, 51, 52] and the references 
therein. See also [18] for an interesting historical overview. 

Finally, maxima also arises in a random model for river networks (see [3, 10]) and in an interesting statistical 
estimate called "layered nearest neighbor estimate" (see [11]). 

1.2 Maxima, maximal layers and related notions 

Maxima are often used for some ranking purposes or used as a component problem for more sophisticated situations. 
Whatever the use, one can easily associate such a notion to define multidimensional sorting procedures. One of 
the most natural ways is to "peel off" the current maxima, regarded as the first-layer maxima, and then finding 
the maxima of the remaining points, regarded then as the second-layer maxima, and so on until no point is left. 
The total number of such layers gives rise to a natural notion of depth, which is referred to as the height of the 
corresponding random, partially ordered sets in [13]. Such a maximal-layer depth is nothing but the length of the 
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longest increasing subsequences in random permutations when the points are selected uniformly and independently 
from the unit square, a problem having attracted widespread interests, following the major breakthrough paper [4] . 

On the other hand, the maximal layers are closely connected to chains (all elements comparable) and antichains 
(all elements incomparable) of partially ordered set in order theory, an interesting result worthy of mention is the 
following dual version of Dilworth's theorem, which states that the size of the largest chain in a finite partial order 
is equal to the smallest number of antichains into which the partial order may be partitioned; see, for example, [38] 
for some applications. 

In addition to these aspects, maximal layers have also been widely employed in multi-objective optimization 
applications since the concept was first suggested in Goldberg's book [32]. Based on identifying the maximal 
layer one after another, Srinivas and Deb [48] proposed the non-dominated sorting genetic algorithm (NSGA) to 
simultaneously find multiple Pareto-optimal points, which was later on further improved in [21], reducing the time 
complexity from 0(dn 3 ) to 0(dn 2 ). (This paper has soon become highly-cited.) Jensen [37] then gave a divide- 
and-conquer algorithm to find the maximal layers with time complexity (n(logn) d_1 ); see Section 5 for more 
details. 

In the contexts of multi-objective optimization problems, elitism usually refers to the mechanism of storing 
some obtained non-dominated solutions into an external archive during the process of MOEAs because a non- 
dominated solution with respect to its current data is not necessarily non-dominated with respect to the whole 
feasible solutions. The idea of elitism was first introduced in [52] and is regarded as a milestone in the development 
of of MOEAs [18]. Since the effectiveness of this mechanism relies on the size of the external non-dominated set, 
an elite archive with limited size was suggested to store the truncated non-dominated sets [40, 52], so as to avoid 
the computational costs of maintaining all non-dominated sets. Nevertheless, restricting the size of archive reduces 
the quality of solutions; more efficient storages and algorithms are thus studied for unconstrained elite archives; see 
for example [27, 37, 44]. 

1.3 Aim and organization of this paper 

Due to the importance of maxima, a large number of algorithms for finding them in a given sample of points 
have been proposed and extensively studied in the literature, and many different design paradigms were introduced 
including divide-and-conquer, sequential, bucket or indexing, selection, and sieving; see [16] for a brief survey. 
Quite naturally, practical algorithms often merge more than one of the design paradigms for better performance. 

Despite the huge number of algorithms proposed in the literature, there is still need of simpler and practically 
efficient algorithms whose performance does not deteriorate too quickly in massive point samples as the number of 
maximal points grows, a property which we simply refer to as "scalable". This is an increasingly important property 
as nowadays massive data sets or data streams are becoming ubiquitous in diverse areas. 

Although for most practical ranking and selecting purposes, the notion of maxima is most useful when the 
number of maxima is not too large compared with the sample size, often there is no a priori information on the 
number of maxima before computing them. 

Furthermore, a general-purposed algorithm may in practice face the situation of data samples with very large 
standard deviation for their maxima. From known probabilistic theory of maxima (see [1] and the references 
therein), the expected number of maxima and the corresponding variance can in two typical random models grow 
either in 0((log n) d ^ 1 ) when the coordinates are roughly independent or in 0(n 1 ~ 1 l d ) when the coordinates are 
roughly negatively dependent, both O-terms here referring to large n, the sample size, and fixed d, the dimen- 
sionality. In particular, in the planar case, there can be y/n number of maxima on average for roughly negatively 
correlated coordinates, in contrast to log n for independent coordinates; see also [6, 33] for the "gap theorem" and 
[24] for a similar y/n vs log n effect (reflecting dependence or independence) on random Cartesian trees. Since the 
maximal points can be very abundant with large standard deviations, more efficient and more uniformly scalable 
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algorithms are needed. 

We propose in this paper two new techniques to achieve scalability: the first technique is to reduce the maxima- 
finding to a two-phase records-finding procedure, giving rise to a no-deletion algorithm, which largely simplifies 
the design and maintenance of the data structure used. The second technique is the introduction of bounding box 
in the corresponding tree structure for storing the current maxima, which reduces significantly the deterioration of 
efficiency in higher dimensions. The combined use of both techniques on k-d trees turns out to be very efficient, 
easily coded and outperforms many known efficient algorithms. Some preliminary results on the use of k-d trees 
for finding maxima of appeared in [17]. 

This paper is organized as follows. In the next section, we briefly describe some existing algorithms proposed 
in the diverse literature, focusing on the two most popular and representative paradigms: divide- and-conquer and 
sequential. Section 3 gives details of the new techniques, implementation on k-d trees, and diverse aspects of further 
improvements. A comparative discussion will also be given with major known algorithms. Analytic and empirical 
aspects of the performance of the algorithms will be discussed in Section 4. Finally, we apply our new algorithm 
to the problems of finding maximal layers and that of finding multiple longest common subsequence in Section 5, 
where the efficiency of our algorithm is tested on several data sets. 

Throughout this paper, Max(p) always denotes the maxima of the sequence of points p = {pi, . . . , p n }. 

2 Known maxima-finding algorithms — a brief account 

In view of the large amount of algorithms with varying characters appeared in the literature, it is beyond the scope 
of this paper to provide a full description of all existing algorithms. Instead, we give a brief account here on 
divide-and-conquer and sequential algorithms; see [16] and the references there for other algorithms. 

2.1 Divide-and-conquer algorithms 

Divide-and-conquer algorithms were first proposed by Kung et al. [43] with the worst-case time complexity of 
order n(log n) d ~ 2+Sd < 2 for dimensionality d > 2, where n is the number of points and 5 a ^ denotes the Kronecker 
delta function. Bentley [8] schematized a multidimensional divide-and-conquer paradigm, which in particular is 
applicable to the maxima-finding problem with the same worst-case complexity. Gabow et al. [29] later improved 
the complexity to 0(n(logn) d ~ 3 loglogn) for d > 4 by scaling techniques. Output-sensitive algorithms with 
complexity of order n(log(M + \^ d ~ 2+6d ' 2 were devised in [39], where M denotes the number of maxima. 
The typical pattern of most of these algorithms is as follow. 

Algorithm Divide-and-conquer 

//Input: A sequence of points p = {pi, . . . , p n } in R d 

//Output: Max(p) 

begin 

if n < 1 then return({pi, . . . , p n }) 

else return Filter-out-false-maxima(Divide-and-conquer({pi, . . . , P[ n / 2 j}), 

Divide-and-conquer({p L „ /2j+ i, . . . , p„}) 

end 

Here Filter-OUt-false-maxima(p, q) drops maxima in q that are dominated by maxima in p. 

These divide-and-conquer algorithms are generally characterized by their good theoretic complexity in the 
worst case, simple structural decompositions in concept but low competitiveness in practical and typical situations 
with sequential algorithms, although it is known that most divide-and-conquer algorithms have linear expected-time 
performance under the usual hypercube random model, or more generally when the expected number of maxima 
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is of order o(n 1 e ); see [22, 28]. Variants of them have however been adapted in the skyline and evolutionary 
computation contexts; see for example [45] for skylines and [37] for MOEAs. 

2.2 Sequential algorithms 

The most widely-used procedure for finding non-dominated points in multidimensional samples has the following 
incremental, on-line, one-loop pattern (see [9, 43]). 

Algorithm Sequential 

//Input: A sequence of points p = {pi, . . . , p n } in R d 

//Output: Max(p) 

begin 

M := {pi} //M : a data structure for storing the current maxima 
for i := 2 to n do 

if no point in M dominates pi then //updating M 

delete {q G M : q -< pj from M 

insert pi into M 

end 

The algorithm is a natural adaptation of the one-dimensional maximum-finding loop, which represents the very 
first algorithm analyzed in details in Knuth's Art of Computer Programming books [41]. It runs by comparing 
points one after another with elements in the data structure M, which stores the maxima of all elements seen so far, 
called left-to-right maxima or records; it moves on to the next point Pi+i if the new point pi is dominated by some 
element in M, or it removes elements in M dominated by the new point pi and accepts the new point p^ into M. 

For dimensions d>2, such a simple design paradigm was first proposed in [43] (with an additional pre-sorting 
stage for one of the coordinates) and the complexity was analyzed for d = 2 and d = 3. To achieve optimal worst- 
case complexity for d = 3, they used AVL-tree (a simple, balanced variant of binary search tree). The simpler 
implementation using a linear list (and without any pre-sorting procedure) was discussed first in the little known 
paper [35] and later in greater detail in [9], in particular with the move-to-front self-adjusting heuristic. 

The Sequential algorithm, also known as block-nested-loop algorithm [45], is most efficient when the number 
of maxima is a small function of n such as powers of logarithm, but deteriorates rapidly when the number of maxima 
is large. In addition to list employed in [9, 35] to store the maxima for sequential algorithms, many varieties of tree 
structures were also proposed in the literature: quad trees in [35, 44], R-trees in [42], and d-ary trees in [47]; see 
also [45]. But these algorithms become less efficient (in time bound and in space utilization) as the dimensionality 
of data increases, also the maintenance is more complicated. We will see that the use of k-d trees is preferable in 
most cases; see also [16] for the use of binary search trees for d = 2. 

3 A two-phase sequential algorithm based on k-d trees using bounding boxes 

We present our new algorithm based on the ideas of multidimensional non-dominated records, bounding boxes, and 
k-d trees. Further refinements of the algorithm will also be discussed. We then compare our algorithm with a few 
major ones discussed in the literature. 

3.1 The design techniques 

We introduce in this subsection multidimensional non-dominated records, k-d trees and bounding boxes, and will 
apply them later for finding maxima. In practice, each of these techniques can be incorporated equally well into 
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other techniques for finding maxima. 

3.1.1 Multidimensional non-dominated records 

Except for simple data structures such as list, the deletion performed in algorithm Sequential is often the most 
complicated step as it requires a structure re-organization after the removal of the dominated elements. It is then 
natural to see if there are algorithms avoiding or reducing deletions. 

Note that in the special case when d = 1, the two steps "deletion" and "insertion" in algorithm Sequential 
actually reduce to one, and the inserted elements are nothing but the records (or record-breaking elements, left-to- 
right maxima, outstanding elements, etc.). Recall that an element pj in the sequence of reals {pi, . . . , p ra } is called 
a record if pj is not dominated by any element in {pi, . . . , Pj-i}. 

The crucial observation is then based on extending the one-dimensional records to higher dimensions. 

Definition (<i-dimensional non-dominated records). A point pj in the sequence of points in M. d {pi, . . . , p n } is 
said to be a d-dimensional non-dominated record of the sequence {pi, . . . , p n } if pj is not dominated by pi for all 
1 < i < j- We also define pi to be a non-dominated record. 

Such non-dominated records are called "weak records" in [30], but this term seems less informative; see also 
[23] for a different use of records. For simplicity, we write, throughout this paper, records to mean non-dominated 
records when no ambiguity will arise. 

For convenience, write Rec(p) as the set of records of p = {pi, . . . , p n }. 

Lemma 1. For any given set of points {pi, . . . , p ra }, 

Max({pi, . . . ,p n }) = Rec(Rec({pi, . . . ,p n })), 
where {qi, . . . , qfc} := {q^, . . . , qi} denotes the reversed sequence. 

In words, if {qi , . . . , q^} represents the records of the sequence {pi , . . . , p Tt }, then the maxima of {pi , . . . , p n } 
is equal to the records of the sequence {q^, q^-i, . . . , qi}. 

Proof. We prove by contradiction. Assume that there are two points pi and p^ in the set 

Rec(Rec({pi, . . . ,p n })) 

such that pj >- pj. Ifi < j, then pj cannot be arecord and thus cannot be a member of the set Rec({pi, p n }), 
a contradiction. On the other hand, if % > j, then pj is a record and is included in the set Rec({pi, . . . , p„}), but 
then after the order being reversed, it cannot be a record since it is dominated by p«, again a contradiction. □ 

Another interesting property regarding the connection between records and maxima is the following. 

Corollary 1. In algorithm Sequential for finding maxima, the points Pi to be inserted in the for-loop are neces- 
sarily the records, while those deleted are records but not maxima. 

3.1.2 A two-phase sequential algorithm 

Lemma 1 provides naturally a two-phase, no-deletion algorithm for finding maxima: in the first phase, we identify 
the records, and in the second phase, we find the records of the reversed sequence of the output of the first phase 
(so as to remove the records that are not maxima); an example of seven planar points is given in Figure 1. In 
other terms, we perform only the insertion in algorithm Sequential in the first phase, postponing the deletion to be 
carried out in the second. 
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Figure 1: The maxima of the point sample {(2, 7), (3,9), (4,3), (5,8), (7,5), (6,4), (8,6), (9,2)} are marked by 
circles. After Phase 1, (2, 7), (4, 3) and (7, 5) are still left in the list though they are not maximal points. But after 
Phase 2, the resulting list contains all maximal points. 

The precise description of the algorithm is given as follows. Note that in the algorithm a list R is used to store 
the records and has to preserve their relative orders. 

Algorithm Two-Phase 

//Input: A sequence of points p = {pi, . . . , p n } 

//Output: Max(p) 

begin 

// Phase 1 

R:={pi} // R stores the non-dominated records 
k := 1 Ilk counts the number of records 
for i := 2 to n do 

if pi is not dominated by any point in R then 

k := k+ 1 

insert pi at the end of R // so as to retain the input order 

// After the for-loop, R = {p^ , . . . , p jk }, where j 1 < j 2 < ■ ■ ■ < j k - 
II Phase 2 

M := {pj k } II M stores the maxima 
for i := k — 1 downto 1 do 

if pj t is not dominated by any point in M then insert pj i in M 

end 

The correctness of Algorithm Two- Phase is guaranteed by Lemma 1. 

While the two-phase procedure may increase the total number of comparisons made, the real scalar comparisons 
made can actually be simplified since we need only to detect if the incoming element is dominated by some element 
in the list R, and there is no need to check the reverse direction that the incoming element dominates some element 
in R. Thus the code for the detection of dominance or non-dominance is simpler than that of algorithms performing 
deletions. Furthermore, for each vector comparison, it is not necessary to check all coordinates unless one element 
is dominated by the other. Briefly, the two-phase algorithm splits the comparisons made for checking dominance 
between elements in two directions. 
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3.1.3 The k-d trees 



The data structure k-d tree (or multidimensional binary search tree) is a natural extension of binary search tree for 
multidimensional data, where k denotes the dimensionality. For more notational convenience and consistency, we 
also write, throughout this paper, d as the dimensionality (but still use k-d tree instead of d-d tree). It was first 
invented by Bentley [7]. The idea is to use each of the d coordinates cyclically at successive levels of the tree as 
the discriminator and direct points falling in the subtrees. If a node holding the point r = (n, . . . , r^) in a k-d tree 
has the £-th coordinate as the discriminator, then, for any node holding the point w = (wi, . . . , Wd) in the subtrees 
of r, we have the relation we < re if w lies in the left-subtree of r, we > re if w lies in the right-subtree of r. The 
children of r then move on to the (I mod d) + 1-st coordinate as the discriminator. A two-dimensional example is 
given in Figure 2. 









ft p» 



Figure 2: The stepwise construction of a 2-d tree of four points. 



3.1.4 Bounding-boxes 

Bounding boxes are simple techniques in improving the performance of many algorithms, especially those dealing 
with intersecting geometric objects, and have been widely used in many theoretical and practical situations. 

The application of bounding boxes is straightforward. Let u r = (ui, . . . , Ud), where ui is the maximum among 
all the i-th coordinates of points in the subtree rooted at r. Then u r is defined to be the upper bound of the subtree 
rooted at r or simply the upper bound of the node r. Similarly, define v r = (vi,..., vj) to be the lower bound of 
the subtree rooted at r, where Vi is the minimum among all the i-th coordinates of points in the subtree rooted at r. 
A simple example of three-dimensional points is given in Figure 3. For simplicity, we also use the upper (or lower) 
bound of a node . The upper and lower bounds of a node constitute a bounding box for that subtree. 

Now if a point p is not dominated by u r , then obviously p is not dominated by any point in the subtree rooted 
at r. This means that all comparisons between p and all points in the subtree rooted at r can be avoided. Similarly, 
when searching for points in the subtree rooted at r that are dominated by p, we can first compare it with v r , and 
all comparisons between p with each node of that subtree can be saved if v r is not dominated by p. 

Note that although additional comparisons and spaces are needed for implementing the bounding boxes in 
maxima-finding algorithms, the overall performance is generally improved, especially, when dealing with samples 
with a large number of maxima. 
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Figure 3: Consider the subtree containing the points {(4, 3, 2), (9, 5, 4), (7, 2, 1), (5, 6, 3), (8, 9, 6), (2, 7, 9), (6, 2, 8)}. 
Then (9, 9, 9) and (2, 2, 1) are the upper bound and the lower bound of the subtree, respectively. 

Pa 



Ha 

Figure 4: Consider the k-d tree with six points pi,p2,. . . , P6 and a new point q. The upper bounds of the trees 
rooted at pi, p2 and P3 are ui, 112 and 113, respectively. To check if q is dominated by some point in the tree, the 
comparisons between q and subtrees rooted at p2 and p,3 can all be skipped since q is not dominated by 112 and 113. 

3.2 The proposed algorithm 

We give in this subsection our two-phase maxima-finding algorithm using k-d trees and bounding boxes. In this 
algorithm, we need only the upper bounds of the bounding boxes since in each phase we only detect if the new- 
coming element is dominated by existing records. An illustrative example is given in Figure 4. 

For implementation details, the records are stored, during the first phase, not only in a k-d tree but also in a list 
to preserve the order of the records. 

Algorithm Maxima 

//Input: A sequence of points p = {pi, . . . , p n } 
//Output: a k-d tree rooted at r consisting of Max(p) 
begin 

r := pi;u r := p x 

qi := pi // R := {qi}, the sequence of the records. 
k := 1 II k counts the number of records 
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for i := 2 to n do 

if (Dominated (r, = 0) then 

lnsert(r,l, Pi ); 
k := k + 1; q fc := p» 
// R = {qi, . . . , qfc} when z = n 
release the tree rooted at r 
r := q fc ;u r := q fe ; 
for i := k — 1 downto 1 do 

if (Dominated (r, q«) = 0) then lnsert(r, 1, q;) 



Dominated(r,p) 

//Input: A node r in a k-d tree and a point p 

0, if p is not dominated by any point in the subtree rooted at r 

1 , otherwise 



//Output: 
begin 



if (p -< r) then return 1 

if (r.left / and p -< u r .i e fi) then 

if (Dominated(r.left, p) = 1) then return 1 
if (r .right / and p -< u r r i g ht) then 

if (Dominated(r .right, p) = 1) then return 1 
return 



end 



lnsert(r,£, p) 
begin 

u r := max{u r , p} // update the upper bound 
compare the £-th component of p and that of r 
Case 1: pe > re and r. right / 

lnsert(r.right, 1 + i mod d, p) 
Case 2: pi > re and r. right = 

r.right := p; u r . right := p 
Case 3: pe < re and r. left ^ 

lnsert(r.left, 1 + £ mod d, p) 
Case 4: pe < re and r. left = 

r.left := p; u r .i eft := p 



Note that the upper bound of a subtree is updated after a new point is inserted. In the procedure Dominated, the 
"filtering role" played by the upper bounds may quickly reduce many comparisons. In practice, if a point p is not 
dominated by u r .i e f t (or u r r i g ht), then p is not dominated by any point in the subtree and the comparisons between 
p and the points of the subtree are all skipped. 
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3.3 Further improvements: sieving and pruning 

The algorithm Maxima is not on-line in nature since it requires two passes through the input. In this section, we 
discuss sieving and periodic pruning techniques, and present an on-line algorithm. 

Sieving The idea is to select an element (or several elements) as a good sieve (or "keeper"), so as to dominate as 
many as possible in-coming points, thus reducing the total number of comparisons made. This was first introduced 
in [9]. 

For our algorithm Maxima, many of the points inserted into the k-d tree may have limited power of dominating 
in-coming points. We can improve further Algorithm Maxima by choosing the input point with the largest L 1 -norm 
(which is the sum of the absolute values of all coordinates) to be the sieve and incorporate such a procedure as part 
of algorithm Maxima. The resulting implementation is very efficient, notably for samples with only a small number 
of maxima. 

A simple way to incorporate the maximum L 1 -norm point is to replace the line 

for i := 2 to n do 

in algorithm Maxima by the following 

s := pi // s = sieve 
for i := 2 to n do 

if (pi -ft s) then 

s . f s, if || s || x > Ip^ ; 
\ pi, if jsj-L < jpijj , 

where f-)^ denotes the L 1 -norm. Thus the sieving process is carried out only during the first phase. Other sieves 
can be considered similarly. 

Pruning In the first phase of Algorithm Maxi ma, the k-d tree may contain some nodes that are dominated by other 
nodes in the tree, and will only be removed in the second phase of the algorithm. In particular, if the dominated 
nodes are close to the root, then more comparisons may be made. It is thus more efficient to carry out an initial 
pruning of the k-d tree by removing dominated points in the tree after a sufficiently large number of records have 
been inserted (and still small compared with the total sample size). Such an early pruning idea can be implemented 
by running the following procedure. 

Algorithm Prune 

// only called once in the first for-loop of Algorithm Maxima 

// Assume R = {qi, . . . , q^} 

begin 

release the k-d tree 
r := q/^iir : = q^ 
for j := K — 1 downto 1 

if (Dominated (r, q,-) = 0) then lnsert(r, 1, q,) 

end 

We can call Prune when, say i = \n/\\ or i = [n s \, where % is the index in the first for-loop of algorithm 
Maxima. For example, we can take A = 10 and 5 = 2/3. Which choice is optimal is an interesting issue but 
depends on the practical implementations. Also one may consider the use of periodic pruning, but since pruning is 
a costly operation, we chose to apply it only once in our simulations. 
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An on-line algorithm On-line maxima-finding algorithms always retain the maxima of the all input points read 
so far and are often needed in many practical situations. A simple means to convert our algorithm Maxima into 
an on-line one is to add a procedure to delete the dominated elements in the k-d tree. The deletions can be made 
immediately after comparison with each in-coming element, which results in restructuring the whole k-d tree and 
may be very costly if the elements deleted are not near the bottom of a large tree. A simple way to perform the 
deletion of a node is to re-insert all its descendant nodes one by one, in the order inherited from the original input 
sequence. However, the procedure can be time-consuming and the resulting tree may be quite imbalanced. 

We introduce an on-line implementation by storing the current maxima in an extra list. In each iteration, we look 
for all points in the k-d tree that are dominated by the in-coming point p, mark them, and delete the corresponding 
elements from the extra list. The lower bounds of the bounding boxes are useful here. Recall v r = (v±, . . . , v^), 
where Vi is the minimum among all the i-th coordinates of points in the subtree rooted at r. When searching for 
those points in M that are dominated by p, we can skip checking the subtree of r if v r is not dominated by p. 

The on-line algorithm is given as follows. 

Algorithm On-Line-Maxima 

//Input: A sequence of points p = {pi, . . . , p n } 

//Output: M := the list containing Max(p) 

begin 

r := pi;u r := pi; v r := pi 
M:= { Pl } 
for i := 2 to n do 

if (Dominated (r, p^ = 0) then 

Delete(r,pi) 

lnsert(r,l,pi) 

M:=MU{ Pj } 

end 

Delete(r,p) 

//Input: A node r of a k-d tree and a point p 

//Output: a more compact M (all dominated points are removed) 

begin 

if (r -< p) then 

if (r is unmarked) then // The set of unmarked nodes is exactly M 
delete r from M 
mark r 

if (r.left / and v r .i eft -< p) then Delete(r.left, p) 
if (r .right / and v r r i g ht -< p) then Delete(r .right, p) 

end 

Note that the only difference between the procedure Insert of algorithm On-Line-Maxima and that of algorithm 
Maxima is that we need to update both the upper bounds and the lower bounds in the procedure lnsert(r, j, p) of 
algorithm On-Line-Maxima. 
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3.4 Comparative discussions 

We ran a few sequential algorithms and tested their performance under several types of random data, each with 
1000 iterations; the average values of the results are given in Tables 1 and 2. The points are generated uniformly 
and independently at random from a given region D, which is either a hypercube or a simplex. 

• list: a sequential algorithm using a linked list (see [9]); 

• (i-tree: a sequential algorithm using the <i-ary tree proposed in [47]; 

• quadtree: a sequential algorithm using quadtree (see [35, 49, 44]); 

• 2-phase: algorithm Maxima; 

• +prune: algorithm Maxima with an early pruning for % = n/10; 

• +sieve: algorithm Maxima with the max-L 1 -norm sieve; 

• +prune&sieve: algorithm Maxima with pruning for i = n/10 and the max-Z^-norm sieve. 

Table 1 shows evidently that our two-phase maxima-finding algorithms, whether coupling with sieving and 
pruning techniques, perform very well under random inputs from the d-dimensional hypercubes. They are efficient 
and uniformly scalable since the average number of scalar comparisons each point involved is gradually rising, in 
contrast to the fast increase of other algorithms. Note that, according to a result by Devroye [25], we expect that 
the average number of scalar comparisons each point involves tends eventually to d in each case. This is visible for 
d = 3 but less clear for higher values of d, as the convergence rate is very slow. Also the numbers in each column 
first increases as n increases and then decreases. 

On the other hand, although the asymptotic growth rate of the expected numbers of maxima in such cases 
are approximately (logn) d_1 /(d — 1)1 for large n and fixed d, the real values of \x n ^ for moderate d soon become 
large; for example, when d = 10 

{A*ioi,io}<=2,..,8 ~ {94, 765, 4 947, 25 113, 103 300, 357 604, 1 076 503}. 
These values were computed by the recurrence (see [1]) 

^ d =d~l £ H n- j) »n,j (d>2), 
l<j<d 

with n n ^i = 1 for n > 1, where the := J2i<i< n w& Harmonic numbers. They can also be estimated by 
the asymptotic approximations given in [1]. 

The situation is very similar (see Table 2) when the random samples are generated from the d-dimensional 
simplex, D = {x : Xj > 0, ^] 1<K(j Xi < 1} for which the expected numbers of maxima v n ^ are of order n L ~ x l d 
instead of (log n) d ~ 1 ; see [1]. In such cases, v n ^ grows even faster than n n ^. For example, when d = 6, 

{A»io*,6}i=2,...,8 ~ {95, 863, 7281, 57 858, 439 110, 3 223 774, 23 121 832}. 
These values were computed by the exact formula 

u d ~nT ^~ 1 V-iy r(n)r((i + 1)/rf) (d>2) 
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Table 1: The average numbers of scalar comparisons per input point when D = [0, l] d , where d £ {3, 4, 6, 10}. 

d = 3 



n 


list 


d-tree 


quadtree 


2-phase 


+prune 


+sieve 


+prune&sieve 


10 2 


11.40 


19.38 


13.58 


24.72 


23.23 


19.10 


18.82 


10 3 


11.01 


15.01 


11.38 


24.29 


20.81 


13.23 


12.43 


10 4 


8.28 


12.02 


9.41 


23.30 


17.70 


8.44 


7.69 


10 5 


6.36 


11.21 


8.50 


23.31 


15.70 


5.78 


5.30 


10 6 


5.01 


11.40 


8.07 


23.05 


13.75 


4.40 


4.09 


10 7 


4.24 


11.51 


7.91 


23.76 


12.50 


3.73 


3.54 


10 8 


3.88 


12.02 


7.67 


24.11 


11.39 


3.36 


3.25 


d = 4 


n 


list 


d-tree 


quadtree 


2-phase 


+prune 


+sieve 


+prune&sieve 


10 2 


26.96 


47.28 


30.29 


50.22 


50.05 


44.28 


44.78 


10 3 


37.41 


49.48 


31.53 


53.28 


51.07 


38.43 


37.76 


10 4 


32.48 


40.62 


26.80 


48.34 


43.94 


25.73 


24.79 


10 5 


22.36 


34.32 


22.60 


44.30 


37.75 


16.65 


15.78 


10 6 


14.69 


32.36 


20.66 


42.69 


33.00 


11.32 


10.61 


10 7 


10.08 


32.46 


19.47 


42.74 


29.87 


8.40 


7.80 


10 8 


8.40 


33.04 


19.05 


52.22 


28.88 


6.83 


6.08 


d = 6 


n 


list 


d-tree 


quadtree 


2-phase 


+prune 


+sieve 


+prune&sieve 


10 2 


75.44 


139.19 


74.32 


129.85 


131.41 


126.53 


128.20 


10 3 


228.69 


284.69 


130.37 


193.84 


193.27 


177.23 


177.44 


10 4 


384.86 


343.69 


149.75 


194.56 


194.05 


163.10 


163.17 


10 5 


404.74 


298.21 


131.41 


162.01 


161.27 


116.86 


117.40 


10 6 


310.75 


222.30 


104.53 


133.34 


131.66 


77.55 


78.68 


10 7 


190.08 


166.02 


86.63 


118.09 


112.34 


52.13 


52.65 


10 8 


100.77 


136.69 


74.97 


109.50 


98.93 


36.46 


36.36 



d= 10 



n 


list 


d-tree 


quadtree 


2-phase 


+prune 


+sieve 


+prune&sieve 


10 2 


137.56 


296.70 


132.72 


267.90 


270.67 


269.49 


272.22 


10 3 


1048.73 


1496.07 


458.30 


774.85 


777.16 


769.01 


771.42 


10 4 


5392.57 


4916.40 


1190.22 


1526.83 


1528.66 


1498.47 


1499.93 


10 5 


17779.34 


11463.01 


2201.99 


2126.49 


2132.18 


2062.42 


2067.98 


10 6 


38552.96 


18775.90 




2221.26 


2234.51 


2121.11 


2132.94 


10 7 


59207.23 


20769.36 




2023.64 


1844.68 


1931.37 


1750.01 


10 8 




19226.26 




1544.68 


1387.00 


1429.45 


1261.90 



14 



Table 2: The average numbers of scalar comparisons per input point when D is the <i-dimensional simplex, where 
d = 3, 4 and 6. 



d = 3 



n 


list 


d-tree 


quadtree 


2-phase 


+prune 


+sieve 


+prune&sieve 


10 2 


40.96 


62.81 


30.50 


57.68 


58.00 


57.87 


58.26 


10 3 


134.05 


112.71 


43.98 


82.03 


80.78 


81.34 


80.24 


10 4 


357.25 


203.97 


55.91 


95.20 


92.37 


93.78 


91.23 


10 5 


858.65 


402.18 


76.19 


105.64 


100.79 


104.10 


99.59 


10 6 


1957.22 


835.16 


126.45 


117.42 


107.53 


117.11 


107.60 


10 7 


4334.09 


1678.73 


161.25 


129.18 


106.81 


130.72 


108.50 


10 8 


9417.80 


3543.73 


331.25 


142.22 


116.74 


142.73 


116.98 



d = A 



n 


list 


d-tree 


quadtree 


2-phase 


+prune 


+ sieve 


+prune&sieve 


10 2 


81.74 


123.95 


57.61 


107.18 


108.36 


108.37 


109.55 


10 3 


441.09 


368.00 


117.09 


199.20 


199.35 


199.70 


199.87 


10 4 


1917.26 


910.44 


208.67 


287.21 


286.60 


287.09 


286.49 


10 5 


7316.79 


2230.39 


356.48 


373.86 


371.80 


373.60 


371.60 


10 6 


25786.00 


5948.65 


614.88 


474.28 


460.27 


474.84 


461.06 


10 7 


86609.63 


17071.62 


1302.10 


532.85 


487.16 


534.66 


489.15 


10 8 




53140.49 


4696.73 


651.13 


698.55 


646.59 


693.70 


d = 6 


n 


list 


d-tree 


quadtree 


2-phase 


+prune 


+sieve 


+prune&sieve 


10 2 


126.37 


221.77 


91.42 


175.93 


177.77 


177.79 


179.63 


10 3 


1096.21 


1175.40 


268.67 


467.27 


468.87 


468.96 


470.56 


10 4 


8284.26 


5660.90 


758.05 


993.25 


995.64 


994.77 


997.16 


10 5 


55200.49 


24332.05 


2178.38 


1849.37 


1856.49 


1850.86 


1858.01 


10 6 


331776.01 


93275.52 


6825.69 


3153.92 


3125.31 


3155.81 


3127.10 


10 7 




368306.29 


8418.26 


5090.63 


5029.78 


5092.54 


5031.71 


10 8 








7996.92 


7403.24 


7998.93 


7405.39 
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which follows from 



v n ,d = nP(xi is a maxima) 




n-l 



dx 



Jo 



by straightforward calculations, where T denotes the Gamma function. For similar details, see [1]. 

Unlike hypercubes where sieving is seen to be very helpful, the gain of sieving for random samples whose 
coordinates are roughly negatively correlated is marginal since there is no "omnipotently powerful" point; see 



A feature of the quadtree algorithm is that by its large amount of branching factors (2 d — 2), the position of a 
point in the tree is quickly identified, often after a few comparisons, and the bounding boxes are thus not helpful 
here. We also tested 2-phase quadtree and 2-phase d-tree algorithms, the improvement over the original algorithms 
is much more significant in d-trees than in quadtrees. In contrast, since k-d trees are binary, the use of the bounding 
boxes plays a crucial role in accelerating the performance of the algorithm. 

Note that the data collected in these two tables do not reflect directly the running time of each program. In terms 
of running time, our algorithms perform much better than the others. 

Simulations also suggested that our on-line algorithm is also reasonably efficient when compared with other 
algorithms. 

4 Average-case analysis of algorithm Maxima 

We derive in this section a few analytic results in connection with the performance of the algorithms we proposed 
in this paper. In general, probabilistic analysis of sequential algorithms for finding the maxima of random samples 
is very difficult due to the dynamic nature of the algorithms; see [25, 34] and the references therein. 

4.1 How many non-dominated records are there? 

The performance of Maxima depends heavily on the number of records, which in turn is closely related to the 
number of maxima. 

Theorem 1. Let R n denote the number of non-dominated records in a sequence {pi, . . . , p ra } of independent and 
uniformly distributed points from some region D in M. d . Let M n denote the maxima o/{pi, . . . , p n }. Then 



[6, 33]. 




(1) 



i=l 



Proof. By assumption, 



P(p„ € Max({pi,... 



Pn})) 



P(Pn € 



Max({pi,..., 



Pn}))- 



Thus 



n 



E(M n ) = ]T P ( Pi € Max({pi, . . . , p n })) = nP (p n G 



Max({pi,..., 



Pn}))- 
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Then we have 



E(-R n ) — y^El( p . i s a record) 
i=l 

n 

= ^P(p j 6Max({pi,... ) p,})) 



i=i 

n 



^E(M,) 



i=i z 



□ 



Since E(M n ) is usually of order n Q or (log n) 13 for some a, /? > (see [1, 2, 23]), if we assume that E(M n ) 
cn Q (logn) /3 , where c, /3 > and a G [0, 1], then, by (1), 



E(R n ) 



^n a (logn)^~ if0<a<l; 
a a 

i ^ri (logn)m ~7Tr logn ' ifa = ' 



where a n ~ 6 n means that a n /b n — > 1 as n — > oo. 

In the special case when the region D is the <i-dimensional hypercube [0, l] d , then it is also easily seen that 
the number of non-dominated records in random samples from [0, l] d is identically distributed as the number of 
maxima in random samples from [0, l] d+1 ; see [30]. 

Whichever the case, we always have 



n 

E(R n ) < E(M n ) ~ = 0(E(M n ) logn) 



i=i * 



This partly explains why our two-phase algorithm does not use much more comparisons and runs reasonably effi- 
cient. Also we see that the expected additional memory used for the k-d tree (and possibly the array) is at most a 
log n factor more than the expected number of maxima. 

4.2 Expected cost of the sieve algorithm 

Assume that pi, . . . , p n are sampled independently and uniformly at random from [0, l] d . Let s n be the point with 
the maximum L 1 -norm. Let 1 = (1, . . . , 1). 

d 

Lemma 2. For any c > 0, 

P (|s n - l|i < {cd\) 1 l d rr 1 / d {\ogn) 1 l d ^ > 1 - n~ c , 

for sufficiently large n. 
Proof. For < e < 1 

Pdsn-llj <e) = 1 — FdlPilU < d-e,l <i<n) 

= i-n £ " 



> 1 - e - £dn / dl . 
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Taking e = (cdl^^n 1 / a! (log n) l / d , we see that the last expression is equal to 1 — n c . Note that e < 1 if n is 
large enough. Indeed, nj log n > cdl suffices. □ 

Theorem 2. If the n points {pi, . . . , p„} are sampled independently and uniformly at random from [0, l] d , then 
the expected number of scalar comparisons used by our sieve algorithm satisfies dn + 0(n 1-1 / d (log n) d+1 / d ). 

Proof. The number of scalar comparisons used for the sieve is at most dn. We claim that the expected number of 
the extra comparisons is only 0(re 1_1 / d (log n) d+^/d). Let at = {2d\) l / d i- l / d (\ogi) 1 / d . For i large enough 

Pdsi-ii! <ai) >i-r 2 , 

by Lemma 2. If pj + i G [0, 1 — a,i] d and ||sj — 1^ < aj both hold, then pj + i -< Sj, that is, pj+i is filtered out. Thus, 
additional comparisons are required only when either p^+i [0, 1 — a,i] d or ||s; — 1^ > aj. If pj + i [0, 1 — a,i] d , 
then the additional comparisons used is bounded above by 0(i?j); if ||sj — 1^ > a^, then the extra comparisons are 
at most 0(i). Note that pj + i and Ri are independent. Thus, the expected number of the extra comparisons required 
in the for-loop of p^+i is less than 

P (pi+i t [0, 1 - a,] d ) 0(E(^)) + P (| Si - l|x > a*) O(i) 

= o(r 1 / <i (iogi) d+1 / d ) + o(r 1 ) 

since E(Ri) = O ((logi) d ). Summing over alH = 2, . . . , n, we obtain the required bound. □ 
4.3 Expected performance of Maxima when all points are maxima 

To further clarify the "scalability" of Maxima, we consider in this subsection the expected cost used by Maxima 
under the extreme situation when the <i-dimensional input points are sampled independently and uniformly from 
the the (d — 1) -dimensional simplex D = {x : Xi > 0,^i<i<d x j = 1}- Note that in the skyline context, an 
anti-correlated sample is often discussed, which is the (d — 1) -dimensional simplex with a specified error range. In 
that case, most but not necessarily all points are maxima. Since no deletion is involved in our algorithm Maxima, 
the difference between random samples from the (d — 1) -dimensional simplex and the anti-correlated sample is 
minor. 

When D is the (d — 1) -dimensional simplex, all points are maxima, and the time complexity of most algorithms 
such as the list algorithm (see [9]) is of order 0(M%) = 0(n 2 ). We show that the expected time complexity of 
Maxima is 0(n log n) when d = 2. 

Theorem 3. Assume that the d-dimensional points {pi, • • • , p n } are independently and uniformly distributed in 
the (d — 1) -dimensional simplex. The expected number of comparisons needed by algorithm Maxima for random 
samples is bounded above by 0(n log n) when d = 2. 

We leave open the analysis for the case when d > 3. 

Proof. Since all points in the sample are maxima, the expected number of comparisons used in the first phase and 
that in the second phase are the same. Thus, we focus on the first phase. 

Assume that {pi, . . . , p m } have been stored in a k-d tree. We consider the number of comparisons that p m +i 
may involve inside the two procedures of the for-loop: Insert and Dominated. The expected number of compar- 
isons used in Insert is of order 

0(the expected depth of the k-d tree) = 0(log m), 
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since the k-d tree is essentially a binary search tree (see [7]). 

We now estimate the expected number of comparisons used in Dominated. Since at most three vector compar- 
isons are involved in the procedure Dominated, we analyze the number of times T m the procedure Dominated is 
called. To complete the proof, we show that E(T m ) = 0(log m). 

Obviously, Dominated(r, p m +i) is called when p m +i -< u r . Thus, the number of times Dominated is called 
is equal to the number of nodes r such that p m +i -< u r . Let D r C D be the region that u r covers. Then the 
probability of the event p m +i -< u r conditioning on the k-d tree built from {pi, . . . , p m } equals \D r \ / \D\. Thus 

E(T m ) = 

where the summation runs over all nodes and the expectation is taken with respect to the k-d tree for {pi , . . . , p m }. 
To estimate J2 r we consider A r C D, the possible ranges induced by the nodes of the subtrees rooted at r. 
The precise definition is as follows. Define A r := D when r is the root. If r.left (r. right) represents the point at 
the root node of the left (right) subtree of r, respectively, then 




^r.left := A r n [0, lp- 1 x [0, xj) x [0, l] d ~i, 
bright := A r n [0, lp'- 1 x [ Xj , 1] x [0, l] d -J, 



(j = l,...,d), 



where d = 2, the j-th coordinate is the discriminator of node r and r = (x\, X2, ■ • ■ , x^)- 

Since the union of A r in the same level of the k-d tree is at most D and D r C A r (see Figure 5), we have 



E (.'/'„,) < j-^j- E ! ) J.l r ) < the expected depth of the k-d tree - 0(!or///). 
Note that A r is determined by r and its ancestors; in contrast, D r is determined by r and its offsprings. 



Figure 5: A possible configuration of A r and D r for d = 2. 

For d > 3, the expected time-complexity remains open. However, simulations suggest that for fixed d the 
expected time be of order 0(n(log n) c ) for some c > 0; see Figure 6. On the other hand, for fixed n and increasing 
d, the expected number of comparisons appears to be of order O (dnlog n). 

One way of seeing why our algorithm suffers less from the so-called "curse of dimensionality" than other 
algorithms in such extreme cases is as follows. As is obvious from the proof of Theorem 3, the time complexity 
is proportional to the order of |D r |/| A r \. The more slender A r is, the larger |D r |/|A r | becomes. All four possible 
patterns of A r for d = 3 are shown in Figure 7. The slenderness does not seem to worsen rapidly as there is some 
sort of counter-balancing process at play; see Figure 7. 
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Figure 6: Simulation results of the total number of times the procedure Dominated is called for in the first phase 
for d = 3, 4, 5, ... , 10 and n = 2 K for k from 10 to 20. Here we plot ^f=r ' against k = log 2 n. 

ft lOtio ft 



Figure 7: Here d = 3. All four possible configurations of A r are shown on the left (the four smaller, We can see 
how A r tends to keep from getting too slender by the interaction of x-axis, y-axis and z-axis. Take the leftmost 
region (graph g{) for instance. Whenever A r is split less evenly by ir-axis (graph gi), later splittings along y-axis 
or along z-axis tend to counterbalance the effect caused by x-axis (graph gs). 

5 Applications 

In this section, we apply algorithm Maxima to find successively the maximal layers and to search for the longest 
common subsequence of multiple sequences, respectively. In both cases, our algorithms generally achieve better 
performance. 

5.1 Maximal layers 

The problem is to split the input set of points p into layers according to maxima. Let denote the k-th maximal 
layer of p. Then Li = Max(p) and 



L fc := Max p \ (J Lj , for k>2. 
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Maximal layers have been widely applied in multi-objective optimization problems, and algorithms with 0(n log n)- 
time complexity were known for rinding the two- and three-dimensional maximal layers; see [12, 15]. 

By identifying the first few layers of maxima to preserve the so-called elitism, Srinivas and Deb [48] proposed 
a multi-objective evolutionary algorithm, called non-dominated sorting genetic algorithm (NSGA). This algorithm 
was later improved and called NSGA-II [21], which reduces the worst-case time complexity from 0(dn 3 ) to 0(dn 2 ) 
and soon became extremely popular. Omitting the details of the corresponding genetic algorithms, the NSGA-II 
algorithm [21] for finding the maximal layers can be extracted and summarized in the following two steps. 

Step 1: For each point pi, compute the number of points that dominate it rij := |{pj : Pi -< Pj}\ (rii will be 
referred to as the rank of the point p«) and the set of points dominated by it S« := {p^ : pj -< pi}. 

Step 2: Then the maximal layers can be determined by rij and Sj as follows. The first layer Li contains the points 
with zero rank. For k > 2, remove L^_i and update the rank rij by using Sj. Then, is the set of the points 
with zero rank among all points that remain. 

The running time is obviously 0{dn 2 ) since all pairs of points are compared. 

A straightforward way to compute the maximal layers is to find successively the maxima after the removal of 
each layer. 

Algorithm Peeling 

//Input: A sequence of points {pi, . . . , p„} 
//Output: Maximal layers Li, L2, . . • 
begin 

fc:=0;q:={pi,...,p n } 
while (|q| > 0) 
k:= k + l 

L fc := Find-Maxima (q) 

q := q - L fc 

end 

Algorithm Peeling is simple and efficient in average situations, even though the worst-case complexity is 
0(n 3 ). Any maxima-finding algorithm can be used for the procedure Find-Maxima(q). To study the average 
behavior of algorithm Peeling, we compare two procedures for Find-Maxima: algorithm Maxima and algorithm 
Naive. Algorithm Naive finds maxima using pairwise comparisons. 

Algorithm Naive 

//Input: A set of points q = {qi, . . . , q ra } 
//Output: M = Max(q). 
begin 

M:={} 

for i := 1 to n do 

for j := 1 to n do 

if (i / j and q^ q^ ) then break 

if (j = n) then insert q^ into M 

end 
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Figure 8: Simulation of Deb's algorithm, and the peeling method with algorithm Naive and algorithm Maxima, 
respectively. We compare the number of scalar comparisons used in the algorithms. Here the sample size n = 
10 2 , 10 3 , 10 4 and the points are generated uniformly from [0, l] d for d = 2, 3, . . . , 10. 

Theorem 4. If pi, . . . , p n are independently and uniformly sampled from any given region in ~R d , then the expected 
running time of algorithm Peeling using algorithm Naive is O (n 2 \og(K + 1)), conditioned on the number of 
maximal layers K. 

Proof. Consider the event that the total number of layers is K and the number of points in the z-th layer L, is 4 for 

1 < i < K. 

We now fix k. At the moment of computing the total number of remaining points is equal to N k := Yld=k ^- 
If a point p is in the i-th layer for % > k, then the number of points that dominate p is at least i — k. Thus, the 
expected number of comparisons that p involves in the loop for computing the k-th layer maxima is upper bounded 
by 

< f N k , ifi = k, 

~ \ N k /(i-k), if* > fc, 

since the remaining points preserve the randomness. Summing over all p and k, we obtain the upper bound for the 
expected number of comparisons used 

K K K K i-1 

k=l k=li=k+l i=2 k=l 

< n 2 + n 2 (l+logK). 

This completes the proof. □ 
Note that the proof also extends to more general non-uniform distributions. 

We compare the numbers of scalar comparisons used by the following three algorithms for finding the maximal 
layers: Deb et al.'s algorithm [21], algorithm Peeling using Maxima, and algorithm Peeling using Naive. The 
simulation results are shown in Figure 8. Note that we reverse the order of the remainder after a layer is found to 
make the algorithm more efficient. It is clear that algorithm Peeling using Maxima outperforms generally the other 
two, especially for higher dimensional samples in large data sets. 

5.2 The multiple longest common subsequence problem 

Given two or more strings (or sequences), the longest common subsequence (LCS for short) problem is to determine 
the longest common subsequence obtained by removing zero or more symbols from each string. For example, if 
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si = aabbc and S2 = abac then LCS(s±, S2), the LCS of si and S2, is abc. The LCS of sequences is widely used 
in computational biology, notably in DNA and protein sequence analysis. 

Various algorithms for computing an LCS between two strings were derived in the literature, but much fewer 
algorithms are devoted to the LCS of more than two strings. Hakata and Imai [36] proposed a method for solving 
efficiently the multiple LCS problem. The method is essentially based on minima-finding. 

Let si = a\a2 ■ ■ ■ a n and S2 = 6162 ■ • • b m be two strings. We say that is a match if a\ = bj. Given two 
matches and (12, 32)- If h < h and ji < 32 then 

LCS(ai ■ • • a^, 61 • ■ ■ bj x ) < LCS(ai ■ ■ ■ Oj 2 ,6i • • ■ bj 2 ). 

Thus, finding the LCS can be roughly regarded as finding the maximal layers of all possible matches. However, 
the number of matches is usually too large. The approach proposed in [36] is to find the layers one after another 
as follows. Assume we have found the k-th layer, C^, then the (k + l)-st layer is the minima of all successors of 
Cfc, where a match (12,32) is called a successor of another match if i\ < 12 and j\ < 32 and there is no 

match between them. The minima-finding algorithm proposed in [36] is an improvement over algorithm Naive. 
The algorithm runs as follows. 

Algorithm Hakata-lmai 

//Input: A set of points q = {qi, . . . , q n } 
//Output: M contains minima of q 
begin 

M := {} 

for i := 1 to n do 

if q« is unmarked then 
for 3 := 1 to n do 

if q., is unmarked then 

if (q«-< q^) then mark qj 
if (<\j< q«) then mark q« 
if q« is unmarked then insert q» into M 

end 

This algorithm is similar to the list algorithm if we consider node-marking as a substitute of node-deletion. 
We compare the performance of Hakata-lmai and Maxima for the number of strings 3, 5, 7 and alphabet sizes 
4, 20. See the experimental results in Figure 9 where the improvement achieved by our algorithm is visible. 
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